Marcelino Mayorga Quesada

1. Competition Info (July)

- Link: https://www.kaggle.com/competitions/tabular-playground-series-jul-2022/

- Info: "In this challenge, you are given a dataset where each row belongs to a particular cluster. Your job is to predict the cluster each row belongs to. You are not given any training data, and you are not told how many clusters are found in the ground truth labels."

- Evaluation: "Submissions are evaluated on the Adjusted Rand Index between the ground truth cluster labels of the data and your predicted cluster labels. You are not given the number of ground truth clusters or any training labels. This is a completely unsupervised problem"

2. Clustering Overview

azminetoushikwasi provides a great Explanation on clustering different techniques algorithms

3. Pipelines

4. Data Pipeline

Imports

GetData

Exploratory Data Analysis (EDA)

High Level Data Information

Shape

Info

Describe

Null / NA

Correlation

Column Histograms

Summary

Prepare Data

Remove ID Column

Preprocessing

#

Optimal Cluster count with Distortion Elbow Method (Kmeans Clustering)

Yellowbrick provides a wrapper to handle the distortion with scores on:

Class / Cluster InterDistance

Some clusters overlap with other clusters which will cause noise in order to identify them properly.

Approaches

Data Scaling

BayesianGaussianMixture - Raw cluster prediction

Identify Class Classification Difficulties

Prepare Data

Class / Cluster Balance (SMOTE)

As for it is supervised, it is needed to balance for training

Supervised Learning Cross Validation

Prepare data

Train Models with confident predictions

Evaluate Models with non confident predictions

Inference

Predict with all

Predict with only non_confident

Improvements

Submission

Predictions

PCA

image info